Data Science Foundation - Assignment on Facebook Analysis
Analysis Attempted By : Mohit Dewan
Domain: Facebook - Social media analysis on usage.
Project Overview : Facebook is a popular free social networking website that allows registered users to create profiles, upload photos and video, send messages and keep in touch with friends, family and colleagues. A lot of data are generated on Facebook daily. Emphasis is on looking at the various trends e.g. friend count, tenure, date of brith and distribution of the parameters in our data.
1. Introduction
2. Problem Statement
3. Installing & Importing Libraries
4. Data Acquisition & Description
5. Data Pre-profiling
6. Data Pre-Processing
7. Data Post-profiling
8. Exploratory Data Analysis
9. Summarization
Facebook is an online social media and social networking serviceowned by American technology giant Meta Platforms.
Facebook is a social media platform that was founded in 2004 by Mark Zuckerberg, Eduardo Saverin, Andrew McCollum, Dustin Moskovitz, and Chris Hughes.
Initially designed for college students, it quickly grew in popularity and is now used by people all over the world.
As of 2020, Facebook's has the **second most valuable media brand worldwide**.
Facebook wants to expand their business globally in the near future.
As of December 2022, Facebook claimed **2.96 billion** monthly active users and ranked third worldwide among the most visited websites.
Facebook hired a team of data scientist to analyize the dataset provided and to identify certain patterns with respect to how the users are making use of this most popular social networking app depending on their age group,gender etc..
Based on Data Analysis outcome what can be showcased for their business from the dataset using EDA.
!pip install -q datascience
!pip install -q pandas-profiling
!pip install -q yellowbrick
!pip install -q --upgrade pandas-profiling
!pip install -q --upgrade yellowbrick
!pip install --upgrade scikit-learn
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: scikit-learn in c:\programdata\anaconda3\lib\site-packages (1.2.1)
Collecting scikit-learn
Downloading scikit_learn-1.2.2-cp310-cp310-win_amd64.whl (8.3 MB)
---------------------------------------- 8.3/8.3 MB 20.4 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17.3 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn) (1.23.5)
Requirement already satisfied: scipy>=1.3.2 in c:\users\pankaj dewan\appdata\roaming\python\python310\site-packages (from scikit-learn) (1.9.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn) (2.2.0)
Requirement already satisfied: joblib>=1.1.1 in c:\programdata\anaconda3\lib\site-packages (from scikit-learn) (1.1.1)
Installing collected packages: scikit-learn
Successfully installed scikit-learn-1.2.2
import numpy as np # For numerical python operations
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd # Importing for panel data analysis
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt # Importing pyplot interface using matplotlib
import seaborn as sns # Importing seaborm library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
import warnings # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore") # Warnings will appear only once
| Id | Features | Description |
|---|---|---|
| 01 | UserID | A numeric value uniquely identifying the user. |
| 02 | Age | Age of the user in years. |
| 03 | DOB_Day | Day part of the user's date of birth. |
| 04 | DOB_Year | Year part of the user's date of birth. |
| 05 | DOB_Month | Month part of the user's date of birth. |
| 06 | Gender | Gender of the user. |
| 07 | Tenure | Number of days since the user has been on FB. |
| 08 | Friend Count | Number of friends the user has. |
| 09 | friendships_initiated | Number of friendships initiated by the user. |
| 10 | likes | Total number of posts liked by the user. |
| 11 | likes received | Total Number of likes received by user's posts. |
| 12 | mobile likes | Number of posts liked by the user through mobile app. |
| 13 | mobile likes received | Number of likes received by user through mobile app. |
| 14 | www likes | Number of posts liked by the user through web. |
| 15 | www likes received | Number of likes received by user through web. |
# Importing and Reading the data
fb_df = pd.read_csv(r"E:\fb_data.csv")
fb_df
print('Data Shape: ', fb_df.shape)
Data Shape: (99003, 15)
fb_df.head(5)
| userid | age | dob_day | dob_year | dob_month | gender | tenure | friend_count | friendships_initiated | likes | likes_received | mobile_likes | mobile_likes_received | www_likes | www_likes_received | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2094382 | 14 | 19 | 1999 | 11 | male | 266.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1192601 | 14 | 2 | 1999 | 11 | female | 6.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2083884 | 14 | 16 | 1999 | 11 | male | 13.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1203168 | 14 | 25 | 1999 | 12 | female | 93.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1733186 | 14 | 4 | 1999 | 12 | male | 82.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
fb_df.describe()
| userid | age | dob_day | dob_year | dob_month | tenure | friend_count | friendships_initiated | likes | likes_received | mobile_likes | mobile_likes_received | www_likes | www_likes_received | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 9.900300e+04 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 99001.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 |
| mean | 1.597045e+06 | 37.280224 | 14.530408 | 1975.719776 | 6.283365 | 537.887375 | 196.350787 | 107.452471 | 156.078785 | 142.689363 | 106.116300 | 84.120491 | 49.962425 | 58.568831 |
| std | 3.440592e+05 | 22.589748 | 9.015606 | 22.589748 | 3.529672 | 457.649874 | 387.304229 | 188.786951 | 572.280681 | 1387.919613 | 445.252985 | 839.889444 | 285.560152 | 601.416348 |
| min | 1.000008e+06 | 13.000000 | 1.000000 | 1900.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 1.298806e+06 | 20.000000 | 7.000000 | 1963.000000 | 3.000000 | 226.000000 | 31.000000 | 17.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 1.596148e+06 | 28.000000 | 14.000000 | 1985.000000 | 6.000000 | 412.000000 | 82.000000 | 46.000000 | 11.000000 | 8.000000 | 4.000000 | 4.000000 | 0.000000 | 2.000000 |
| 75% | 1.895744e+06 | 50.000000 | 22.000000 | 1993.000000 | 9.000000 | 675.000000 | 206.000000 | 117.000000 | 81.000000 | 59.000000 | 46.000000 | 33.000000 | 7.000000 | 20.000000 |
| max | 2.193542e+06 | 113.000000 | 31.000000 | 2000.000000 | 12.000000 | 3139.000000 | 4923.000000 | 4144.000000 | 25111.000000 | 261197.000000 | 25111.000000 | 138561.000000 | 14865.000000 | 129953.000000 |
fb_df.describe(include='all')
| userid | age | dob_day | dob_year | dob_month | gender | tenure | friend_count | friendships_initiated | likes | likes_received | mobile_likes | mobile_likes_received | www_likes | www_likes_received | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 9.900300e+04 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 98828 | 99001.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 |
| unique | NaN | NaN | NaN | NaN | NaN | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | NaN | NaN | NaN | NaN | male | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | NaN | NaN | NaN | NaN | 58574 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 1.597045e+06 | 37.280224 | 14.530408 | 1975.719776 | 6.283365 | NaN | 537.887375 | 196.350787 | 107.452471 | 156.078785 | 142.689363 | 106.116300 | 84.120491 | 49.962425 | 58.568831 |
| std | 3.440592e+05 | 22.589748 | 9.015606 | 22.589748 | 3.529672 | NaN | 457.649874 | 387.304229 | 188.786951 | 572.280681 | 1387.919613 | 445.252985 | 839.889444 | 285.560152 | 601.416348 |
| min | 1.000008e+06 | 13.000000 | 1.000000 | 1900.000000 | 1.000000 | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 1.298806e+06 | 20.000000 | 7.000000 | 1963.000000 | 3.000000 | NaN | 226.000000 | 31.000000 | 17.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 1.596148e+06 | 28.000000 | 14.000000 | 1985.000000 | 6.000000 | NaN | 412.000000 | 82.000000 | 46.000000 | 11.000000 | 8.000000 | 4.000000 | 4.000000 | 0.000000 | 2.000000 |
| 75% | 1.895744e+06 | 50.000000 | 22.000000 | 1993.000000 | 9.000000 | NaN | 675.000000 | 206.000000 | 117.000000 | 81.000000 | 59.000000 | 46.000000 | 33.000000 | 7.000000 | 20.000000 |
| max | 2.193542e+06 | 113.000000 | 31.000000 | 2000.000000 | 12.000000 | NaN | 3139.000000 | 4923.000000 | 4144.000000 | 25111.000000 | 261197.000000 | 25111.000000 | 138561.000000 | 14865.000000 | 129953.000000 |
fb_df.nunique() #this function result the unique value of each column
userid 99003 age 101 dob_day 31 dob_year 101 dob_month 12 gender 2 tenure 2426 friend_count 2562 friendships_initiated 1519 likes 2924 likes_received 2681 mobile_likes 2396 mobile_likes_received 2004 www_likes 1726 www_likes_received 1636 dtype: int64
fb_df.columns #this function show the headers of all coloumn
Index(['userid', 'age', 'dob_day', 'dob_year', 'dob_month', 'gender', 'tenure',
'friend_count', 'friendships_initiated', 'likes', 'likes_received',
'mobile_likes', 'mobile_likes_received', 'www_likes',
'www_likes_received'],
dtype='object')
MISSING VALUES 1636
Sometimes it may happen that there might be some missing data or some inconsistency in the data which may create problems further while analyzing it so it is important you replace the missing values or the null values.
Normally there are two ways that can be followed
Either you delete the entire row which has missing data. You fill it with the mean, mode, or median of the data present in the column. I personally prefer the second way because deleting the row may lead to losing the data and while analyzing or while training it is good to have as much data as possible for predicting better results.
The below code shows us how many null values are present in each of the columns in the data frame.
Gender: 175 Null values
Tenure: 2 Null values
pd.isnull(fb_df["gender"]).sum()
175
fb_df.columns
Index(['userid', 'age', 'dob_day', 'dob_year', 'dob_month', 'gender', 'tenure',
'friend_count', 'friendships_initiated', 'likes', 'likes_received',
'mobile_likes', 'mobile_likes_received', 'www_likes',
'www_likes_received'],
dtype='object')
fb_df['gender'].value_counts()
male 58574 female 40254 Name: gender, dtype: int64
fb_df['gender']=fb_df['gender'].fillna(fb_df['gender'].mode()[0])
fb_df['gender'].value_counts()
male 58749 female 40254 Name: gender, dtype: int64
pd.isnull(fb_df["gender"]).sum()
0
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(fb_df.userid)
plt.show()
fb_df.userid.quantile([0.0])[0]
1000008.0
fb_df.userid.quantile([0.25])
0.25 1298805.5 Name: userid, dtype: float64
fb_df.userid.quantile([0.25]).index
Float64Index([0.25], dtype='float64')
fb_df.userid.quantile([0.25]).values
array([1298805.5])
sns.distplot(fb_df.userid)
plt.axvline(fb_df.userid.quantile([0.0])[0], color='r')
plt.axvline(fb_df.userid.quantile([0.25])[0.25], color='r')
plt.axvline(fb_df.userid.quantile([0.50])[0.50], color='r') # median
plt.axvline(fb_df.userid.quantile([0.75])[0.75], color='r')
plt.axvline(fb_df.userid.quantile([1.0])[1.0], color='r')
plt.show()
median_value = fb_df['tenure'].median()
print (median_value)
412.0
fb_df['tenure'].fillna(value=median_value, inplace=True)
pd.isnull(fb_df["tenure"]).sum()
0
fb_df.isnull().sum() # to check if all missing values are handled
userid 0 age 0 dob_day 0 dob_year 0 dob_month 0 gender 0 tenure 0 friend_count 0 friendships_initiated 0 likes 0 likes_received 0 mobile_likes 0 mobile_likes_received 0 www_likes 0 www_likes_received 0 dtype: int64
fb_df.info() #This function gives information about the columns type
<class 'pandas.core.frame.DataFrame'> RangeIndex: 99003 entries, 0 to 99002 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 userid 99003 non-null int64 1 age 99003 non-null int64 2 dob_day 99003 non-null int64 3 dob_year 99003 non-null int64 4 dob_month 99003 non-null int64 5 gender 99003 non-null object 6 tenure 99003 non-null float64 7 friend_count 99003 non-null int64 8 friendships_initiated 99003 non-null int64 9 likes 99003 non-null int64 10 likes_received 99003 non-null int64 11 mobile_likes 99003 non-null int64 12 mobile_likes_received 99003 non-null int64 13 www_likes 99003 non-null int64 14 www_likes_received 99003 non-null int64 dtypes: float64(1), int64(13), object(1) memory usage: 11.3+ MB
info function gives us the following insights into the dataframe:
There are a total of 99003 samples (rows) and 15 columns in the dataframe.
There are 13 columns with a numeric datatype.
There is a float and Category column each.
There are missing values in the data.
fb_df.isna().sum().sort_values(ascending=False)
userid 0 age 0 dob_day 0 dob_year 0 dob_month 0 gender 0 tenure 0 friend_count 0 friendships_initiated 0 likes 0 likes_received 0 mobile_likes 0 mobile_likes_received 0 www_likes 0 www_likes_received 0 dtype: int64
fb_df.head(5)
| userid | age | dob_day | dob_year | dob_month | gender | tenure | friend_count | friendships_initiated | likes | likes_received | mobile_likes | mobile_likes_received | www_likes | www_likes_received | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2094382 | 14 | 19 | 1999 | 11 | male | 266.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1192601 | 14 | 2 | 1999 | 11 | female | 6.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2083884 | 14 | 16 | 1999 | 11 | male | 13.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1203168 | 14 | 25 | 1999 | 12 | female | 93.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1733186 | 14 | 4 | 1999 | 12 | male | 82.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
fb_df.gender.unique()
array(['male', 'female'], dtype=object)
# To check which column is unique, null and not-null values. If unique found then set as Index column
fb_df.agg(['count', 'size', 'nunique'])
| userid | age | dob_day | dob_year | dob_month | gender | tenure | friend_count | friendships_initiated | likes | likes_received | mobile_likes | mobile_likes_received | www_likes | www_likes_received | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 99003 | 99003 | 99003 | 99003 | 99003 | 98828 | 99001 | 99003 | 99003 | 99003 | 99003 | 99003 | 99003 | 99003 | 99003 |
| size | 99003 | 99003 | 99003 | 99003 | 99003 | 99003 | 99003 | 99003 | 99003 | 99003 | 99003 | 99003 | 99003 | 99003 | 99003 |
| nunique | 99003 | 101 | 31 | 101 | 12 | 2 | 2426 | 2562 | 1519 | 2924 | 2681 | 2396 | 2004 | 1726 | 1636 |
## Data Structure
## The data frame has 99003 observations and 15 variables.
## Apart from "gender" all the other variables are integers and tenure as float as 2 records are null.
import pandas as pd
from ydata_profiling import ProfileReport
profile = fb_df.profile_report(title = 'Pre Profile Facebook Dataset')
profile.to_file(output_file='Pre Profile Facebook Data Analysis.html')
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
Observations from Pandas Profiling before Data Processing
Dataset info:
Number of variables:15 Number of observations: 99003 Missing cells: <0.1% Variables types:
Numeric = 14 Categorical = 1 Dataset has 175 (0.2%) Missing Value in gender column and 2 in tenure column
There are Zeros in the following Columns
friend_count friendships_initiated likes likes_received mobile_likes mobile_likes_received www_likes www_likes_received
fb_df['gender'].mode()
0 male Name: gender, dtype: object
fb_df['gender'] = fb_df['gender'].replace(np.NaN,'male')
fb_df['gender'].unique()
array(['male', 'female'], dtype=object)
fb_df['tenure'].median()
412.0
fb_df['tenure'] = fb_df['tenure'].replace(np.NaN,412.0)
fb_df.isnull().sum().sort_values(ascending = False)
age 0 dob_day 0 dob_year 0 dob_month 0 gender 0 tenure 0 friend_count 0 friendships_initiated 0 likes 0 likes_received 0 mobile_likes 0 mobile_likes_received 0 www_likes 0 www_likes_received 0 dtype: int64
# Before convereting day, month and year column to Date Of birth, need to check date validity for each column
print("Feature 'DOB_Day' has '{unique_values}' unique values".format(unique_values=np.sort(fb_df["dob_day"].unique())))
print("Feature 'DOB_Month' has '{unique_values}' unique values".format(unique_values=np.sort(fb_df["dob_month"].unique())))
print("Feature 'DOB_Year' has '{unique_values}' unique values".format(unique_values=np.sort(fb_df["dob_year"].unique())))
Feature 'DOB_Day' has '[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31]' unique values Feature 'DOB_Month' has '[ 1 2 3 4 5 6 7 8 9 10 11 12]' unique values Feature 'DOB_Year' has '[1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000]' unique values
# As day, month and year column data valid now convert into date type column 'DateOfBirth'
fb_df.insert(1,"DateOfBirth",pd.to_datetime(fb_df.dob_year*10000+fb_df.dob_month*100+fb_df.dob_day,format='%Y%m%d'))
fb_df.head()
| age | DateOfBirth | date_of_birth | dob_day | dob_year | dob_month | gender | tenure | friend_count | friendships_initiated | likes | likes_received | mobile_likes | mobile_likes_received | www_likes | www_likes_received | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14 | 1999-11-19 | 1999-11-19 | 19 | 1999 | 11 | male | 266.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 14 | 1999-11-02 | 1999-11-02 | 2 | 1999 | 11 | female | 6.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 14 | 1999-11-16 | 1999-11-16 | 16 | 1999 | 11 | male | 13.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 14 | 1999-12-25 | 1999-12-25 | 25 | 1999 | 12 | female | 93.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 14 | 1999-12-04 | 1999-12-04 | 4 | 1999 | 12 | male | 82.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
fb_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 99003 entries, 0 to 99002 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 99003 non-null int64 1 DateOfBirth 99003 non-null datetime64[ns] 2 date_of_birth 99003 non-null datetime64[ns] 3 dob_day 99003 non-null int64 4 dob_year 99003 non-null int64 5 dob_month 99003 non-null int64 6 gender 99003 non-null object 7 tenure 99003 non-null float64 8 friend_count 99003 non-null int64 9 friendships_initiated 99003 non-null int64 10 likes 99003 non-null int64 11 likes_received 99003 non-null int64 12 mobile_likes 99003 non-null int64 13 mobile_likes_received 99003 non-null int64 14 www_likes 99003 non-null int64 15 www_likes_received 99003 non-null int64 dtypes: datetime64[ns](2), float64(1), int64(12), object(1) memory usage: 12.1+ MB
fb_df.describe(include = 'all')
| age | DateOfBirth | date_of_birth | dob_day | dob_year | dob_month | gender | tenure | friend_count | friendships_initiated | likes | likes_received | mobile_likes | mobile_likes_received | www_likes | www_likes_received | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 99003.000000 | 99003 | 99003 | 99003.000000 | 99003.000000 | 99003.000000 | 99003 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 | 99003.000000 |
| unique | NaN | 23151 | 23151 | NaN | NaN | NaN | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | 1988-01-01 00:00:00 | 1988-01-01 00:00:00 | NaN | NaN | NaN | male | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | 656 | 656 | NaN | NaN | NaN | 58749 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| first | NaN | 1900-01-01 00:00:00 | 1900-01-01 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| last | NaN | 2000-10-27 00:00:00 | 2000-10-27 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 37.280224 | NaN | NaN | 14.530408 | 1975.719776 | 6.283365 | NaN | 537.884832 | 196.350787 | 107.452471 | 156.078785 | 142.689363 | 106.116300 | 84.120491 | 49.962425 | 58.568831 |
| std | 22.589748 | NaN | NaN | 9.015606 | 22.589748 | 3.529672 | NaN | 457.645601 | 387.304229 | 188.786951 | 572.280681 | 1387.919613 | 445.252985 | 839.889444 | 285.560152 | 601.416348 |
| min | 13.000000 | NaN | NaN | 1.000000 | 1900.000000 | 1.000000 | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 20.000000 | NaN | NaN | 7.000000 | 1963.000000 | 3.000000 | NaN | 226.000000 | 31.000000 | 17.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 28.000000 | NaN | NaN | 14.000000 | 1985.000000 | 6.000000 | NaN | 412.000000 | 82.000000 | 46.000000 | 11.000000 | 8.000000 | 4.000000 | 4.000000 | 0.000000 | 2.000000 |
| 75% | 50.000000 | NaN | NaN | 22.000000 | 1993.000000 | 9.000000 | NaN | 675.000000 | 206.000000 | 117.000000 | 81.000000 | 59.000000 | 46.000000 | 33.000000 | 7.000000 | 20.000000 |
| max | 113.000000 | NaN | NaN | 31.000000 | 2000.000000 | 12.000000 | NaN | 3139.000000 | 4923.000000 | 4144.000000 | 25111.000000 | 261197.000000 | 25111.000000 | 138561.000000 | 14865.000000 | 129953.000000 |
profile = fb_df.profile_report(title = 'Post Profile Facebook Data')
profile.to_file(output_file='Post Profile Facebook Data Analysis after Processing.html')
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
Now we have preprocessed the data, now the dataset doesnot contain missing values, we have also introduced new feature named date_of_birth. So, the pandas profiling report which we have generated after preprocessing will give us more beneficial insights. You can compare the two reports, i.e Post Profile Facebook Data Analysis after Processing.html and Pre Profile Facebook Data Analysis before Processing.html.<br/
observations:
Data Post-Processing.
fb_df.drop_duplicates(inplace=True)
fb_df.describe(include='all')
| age | DateOfBirth | date_of_birth | dob_day | dob_year | dob_month | gender | tenure | friend_count | friendships_initiated | likes | likes_received | mobile_likes | mobile_likes_received | www_likes | www_likes_received | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 98995.000000 | 98995 | 98995 | 98995.000000 | 98995.000000 | 98995.000000 | 98995 | 98995.000000 | 98995.000000 | 98995.000000 | 98995.000000 | 98995.000000 | 98995.000000 | 98995.000000 | 98995.000000 | 98995.000000 |
| unique | NaN | 23151 | 23151 | NaN | NaN | NaN | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | 1988-01-01 00:00:00 | 1988-01-01 00:00:00 | NaN | NaN | NaN | male | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | 651 | 651 | NaN | NaN | NaN | 58741 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| first | NaN | 1900-01-01 00:00:00 | 1900-01-01 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| last | NaN | 2000-10-27 00:00:00 | 2000-10-27 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 37.281146 | NaN | NaN | 14.531502 | 1975.718854 | 6.283792 | NaN | 537.916986 | 196.366625 | 107.461124 | 156.091399 | 142.700894 | 106.124875 | 84.127289 | 49.966463 | 58.573564 |
| std | 22.590414 | NaN | NaN | 9.015150 | 22.590414 | 3.529495 | NaN | 457.646219 | 387.315871 | 188.792125 | 572.302084 | 1387.975100 | 445.269954 | 839.923040 | 285.571337 | 601.440418 |
| min | 13.000000 | NaN | NaN | 1.000000 | 1900.000000 | 1.000000 | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 20.000000 | NaN | NaN | 7.000000 | 1963.000000 | 3.000000 | NaN | 226.000000 | 31.000000 | 17.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 28.000000 | NaN | NaN | 14.000000 | 1985.000000 | 6.000000 | NaN | 412.000000 | 82.000000 | 46.000000 | 11.000000 | 8.000000 | 4.000000 | 4.000000 | 0.000000 | 2.000000 |
| 75% | 50.000000 | NaN | NaN | 22.000000 | 1993.000000 | 9.000000 | NaN | 675.000000 | 206.000000 | 117.000000 | 81.000000 | 59.000000 | 46.000000 | 33.000000 | 7.000000 | 20.000000 |
| max | 113.000000 | NaN | NaN | 31.000000 | 2000.000000 | 12.000000 | NaN | 3139.000000 | 4923.000000 | 4144.000000 | 25111.000000 | 261197.000000 | 25111.000000 | 138561.000000 | 14865.000000 | 129953.000000 |
labels = ['12-14', '15-20','21-30','31-40','41-50', '51-60', '61-70', '71-80','81-90','91-100','101-110','111-120']
fb_df['age_group'] = pd.cut(fb_df['age'],
[10,15,20,30,40,50,60,70,80,90,100,110,120],
labels= labels, include_lowest=True)
sns.set_style('whitegrid')
plt.figure(figsize=(10,5))
sns.countplot(x='age_group',data=fb_df)
<AxesSubplot: xlabel='age_group', ylabel='count'>
sns.set_style('whitegrid')
plt.figure(figsize=(20,12))
plt.xticks(rotation=45)
sns.countplot(x='age',data=fb_df)
<AxesSubplot: xlabel='age', ylabel='count'>
fb_df.groupby(['age_group'])['age_group'].count()
age_group 12-14 5027 15-20 19725 21-30 28639 31-40 12490 41-50 8968 51-60 9319 61-70 6855 71-80 2249 81-90 817 91-100 1219 101-110 3449 111-120 238 Name: age_group, dtype: int64
fb_df['gender'].value_counts().plot(kind='pie',explode=[0.02,0.02],fontsize=14, autopct='%1.1f%%',
figsize=(8,8), shadow=True, startangle=135, legend=True, cmap='summer')
plt.title('Pie chart showing the Gender wise Facebook Users Status')
Text(0.5, 1.0, 'Pie chart showing the Gender wise Facebook Users Status')
# The following analysis we are doing against Date Of birth and Gender
labels = ['11-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80','81-90','91-100','101-110','111-120']
fb_df['age_group'] = pd.cut(fb_df.age, [10,20,30,40,50,60,70,80,90,100,110,120], right=True, labels=labels)
fb_df['age_group'] = fb_df['age_group'].astype('category')
facebook = fb_df.copy()
sns.set()
groupByAgeGroupAndGender = fb_df.groupby(['age_group','gender']).size().reset_index(name='counts')
with sns.axes_style('white'):
fb_df = sns.catplot(x="age_group", y=None ,hue ='gender',data=fb_df, kind="count",height=8, color='blue', aspect=2)
sns.pointplot(x='age_group', y='counts', data=groupByAgeGroupAndGender, color='r')
plt.title("Distribution of genderwise age group against user count")
plt.ylabel('count of facebook users')
plt.xlabel('age group');
plt.show()
facebook = facebook.reset_index()
var = 'age'
fig, axes =plt.subplots(4,3, figsize=(15,12))
axes = axes.flatten()
for ax, i in zip(axes,labels):
sns.countplot(y = var, data = facebook[facebook["age_group"] == i] ,ax=ax)
plt.tight_layout()
plt.show()
sns.set(color_codes=True)
plt.figure(figsize=(20,12))
sns.set_palette(sns.color_palette("Set1", n_colors=5, desat=.5))
sns.distplot(fb_df["tenure"])
<AxesSubplot: xlabel='tenure', ylabel='Density'>
fb_df["friend_count"].value_counts()
0 1956
1 1815
2 1116
3 860
5 789
...
2601 1
2921 1
4066 1
2647 1
2002 1
Name: friend_count, Length: 2562, dtype: int64
Percentage_of_friend_count_nill = fb_df['friend_count'].value_counts().max() / (fb_df.friend_count.count())*100
print('Percentage of Zero Friend Count = ', Percentage_of_friend_count_nill.round(decimals=2))
Percentage of Zero Friend Count = 1.98
sns.set(color_codes=True)
plt.figure(figsize=(18,9))
sns.set_palette(sns.color_palette("muted"))
sns.set(font_scale=1.5)
sns.distplot(fb_df["friend_count"])
<AxesSubplot: xlabel='friend_count', ylabel='Density'>
plt.figure(figsize=(18,9))
sns.set_palette(sns.color_palette("deep"))
sns.set(font_scale=1.5)
sns.distplot(fb_df["friendships_initiated"])
<AxesSubplot: xlabel='friendships_initiated', ylabel='Density'>
fb_df['friendships_initiated'].value_counts()
0 2991
1 2211
2 1550
3 1355
4 1352
...
1987 1
2012 1
2213 1
1804 1
1524 1
Name: friendships_initiated, Length: 1519, dtype: int64
Percentage_of_friendships_Initiated_nill = fb_df['friendships_initiated'].value_counts().max() / (fb_df.friendships_initiated.count())*100
print('Percentage of Zero Friend Count = ', Percentage_of_friendships_Initiated_nill.round(decimals=2))
Percentage of Zero Friend Count = 3.02
fig,ax =plt.subplots(figsize=(10,8))
sns.set(font_scale=1)
sns.countplot(data = fb_df,x = 'age_group', hue='gender')
plt.title('Age vs Gender')
Text(0.5, 1.0, 'Age vs Gender')
fig = sns.FacetGrid(fb_df,hue='gender',aspect=5)
fig.map(sns.kdeplot,'age',shade=True)
oldest = fb_df['age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()
plt.title('Age distribution using FacetGrid')
Text(0.5, 1.0, 'Age distribution using FacetGrid')
fig,ax =plt.subplots(figsize=(15,10))
sns.set(font_scale=1)
sns.boxplot(data=fb_df,x='age_group',y='tenure')
<AxesSubplot: xlabel='age_group', ylabel='tenure'>
sns.set(font_scale=1.5)
sns.set_palette(sns.color_palette("Set2", n_colors=5, desat=.5))
sns.catplot(x="age_group", y='tenure',hue ='gender',data=fb_df, kind="bar",height=8, aspect=2)
<seaborn.axisgrid.FacetGrid at 0x1e32d26ea40>
Max_tenure_in_year = (fb_df['tenure'].max())/365
Max_tenure_in_year
8.6
sns.set(font_scale=1.5)
sns.set_palette(sns.color_palette("Set2", n_colors=5, desat=.5))
sns.catplot(x="age_group", y='friend_count',hue ='gender',data=fb_df, kind="bar",height=8, aspect=2)
<seaborn.axisgrid.FacetGrid at 0x1e32c2697e0>
fig,ax =plt.subplots(figsize=(10,5))
sns.set(font_scale=1)
sns.set_palette(sns.color_palette("Set2", n_colors=5, desat=.5))
sns.boxplot(data=fb_df,x='age_group',y='friendships_initiated')
<AxesSubplot: xlabel='age_group', ylabel='friendships_initiated'>
sns.set(font_scale=1.5)
sns.set_palette(sns.color_palette("Set2", n_colors=5, desat=.5))
sns.catplot(x="age_group", y='friendships_initiated',hue ='gender',data=fb_df, kind="bar",height=8, aspect=2)
<seaborn.axisgrid.FacetGrid at 0x1e32b2fb8e0>
Likes = ['likes', 'mobile_likes', 'www_likes', 'likes_received','mobile_likes_received','www_likes_received']
for value in Likes:
sns.set(font_scale=1.5)
sns.set_palette(sns.color_palette("Set2", n_colors=5, desat=.5))
sns.catplot(x="age_group", y=value,hue ='gender',data=fb_df, kind="bar",height=8, aspect=2)
Percentage_of_likes = (fb_df.likes.count() - fb_df['likes'].value_counts().max()) / (fb_df.likes.count())*100
Percentage_of_www_likes = (fb_df.likes.count() - fb_df['www_likes'].value_counts().max())/(fb_df.likes.count())*100
Percentage_of_mobile_likes = (fb_df.likes.count() - fb_df['mobile_likes'].value_counts().max())/(fb_df.likes.count())*100
Percentage_of_likes_received = (fb_df.likes.count() - fb_df['likes_received'].value_counts().max())/(fb_df.likes_received.count())*100
Percentage_of_www_likes_received = (fb_df.likes.count() - fb_df['www_likes_received'].value_counts().max())/(fb_df.likes_received.count())*100
Percentage_of_mobile_likes_received = (fb_df.likes.count() - fb_df['mobile_likes_received'].value_counts().max())/(fb_df.likes_received.count())*100
print('Percentage of Likes = ',Percentage_of_likes)
print('Percentage of Mobile Likes = ',Percentage_of_mobile_likes)
print('Percentage of Website Likes = ',Percentage_of_www_likes)
print('Percentage of Likes Received = ',Percentage_of_likes_received)
print('Percentage of Mobile Likes Received = ',Percentage_of_mobile_likes_received)
print('Percentage of website Likes Received = ',Percentage_of_www_likes_received)
Percentage of Likes = 77.47360977827164 Percentage of Mobile Likes = 64.5961917268549 Percentage of Website Likes = 38.389817667558965 Percentage of Likes Received = 75.33208747916561 Percentage of Mobile Likes Received = 69.70048992373353 Percentage of website Likes Received = 62.76983686044749
Not_mobile_likes = 100- Percentage_of_mobile_likes
Not_www_likes = 100 - Percentage_of_www_likes
Not_mobile_likes_received = 100- Percentage_of_mobile_likes_received
Not_www_likes_received = 100 - Percentage_of_www_likes_received
labels = 'Yes','No'
size1 = [Percentage_of_mobile_likes,Not_mobile_likes]
size2 = [Percentage_of_www_likes,Not_www_likes]
size3 = [Percentage_of_mobile_likes_received,Not_mobile_likes_received]
size4 = [Percentage_of_www_likes_received,Not_www_likes_received]
colors = ['lightBlue', 'lightcoral']
plt.figure(figsize=(18,10), dpi=1600)
ax1 = plt.subplot2grid((2,2),(0,0))
plt.pie(size1,labels=labels, colors=colors,autopct='%1.1f%%',startangle=90)
plt.title('Users liked the Posts through Mobile App')
ax1 = plt.subplot2grid((2,2),(0,1))
plt.pie(size2,labels=labels, colors=colors,autopct='%1.1f%%',startangle=90)
plt.title('Users liked the Posts through Website')
ax1 = plt.subplot2grid((2,2),(1,0))
plt.pie(size3,labels=labels, colors=colors,autopct='%1.1f%%',startangle=90)
plt.title('Users Received the likes through Mobile App')
ax1 = plt.subplot2grid((2,2),(1,1))
plt.pie(size4,labels=labels, colors=colors,autopct='%1.1f%%',startangle=90)
plt.title('Users Received the likes through Website')
plt.axis('equal')
plt.show()
figsize=(15,10)
sns.scatterplot(x="friend_count", y="friendships_initiated", hue="gender", data = fb_df)
<AxesSubplot: xlabel='friend_count', ylabel='friendships_initiated'>
figsize=(15,10)
sns.scatterplot(x="friend_count", y="tenure", hue="gender", data = fb_df)
<AxesSubplot: xlabel='friend_count', ylabel='tenure'>
fb_df.corr() #CorreLation matrix
| age | dob_day | dob_year | dob_month | tenure | friend_count | friendships_initiated | likes | likes_received | mobile_likes | mobile_likes_received | www_likes | www_likes_received | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| age | 1.000000 | 0.034978 | -1.000000 | 0.025109 | 0.462685 | -0.027429 | -0.058084 | -0.013020 | -0.022575 | -0.026725 | -0.024253 | 0.015578 | -0.018228 |
| dob_day | 0.034978 | 1.000000 | -0.034978 | 0.129285 | 0.041759 | 0.021902 | 0.022932 | 0.015948 | 0.001355 | 0.014513 | 0.000485 | 0.009332 | 0.002449 |
| dob_year | -1.000000 | -0.034978 | 1.000000 | -0.025109 | -0.462685 | 0.027429 | 0.058084 | 0.013020 | 0.022575 | 0.026725 | 0.024253 | -0.015578 | 0.018228 |
| dob_month | 0.025109 | 0.129285 | -0.025109 | 1.000000 | 0.029343 | 0.019745 | 0.020009 | 0.014115 | 0.006484 | 0.010372 | 0.006424 | 0.012116 | 0.005992 |
| tenure | 0.462685 | 0.041759 | -0.462685 | 0.029343 | 1.000000 | 0.166230 | 0.133474 | 0.057116 | 0.027739 | 0.028037 | 0.023965 | 0.070747 | 0.030547 |
| friend_count | -0.027429 | 0.021902 | 0.027429 | 0.019745 | 0.166230 | 1.000000 | 0.825846 | 0.298010 | 0.236461 | 0.235650 | 0.232699 | 0.229798 | 0.220726 |
| friendships_initiated | -0.058084 | 0.022932 | 0.058084 | 0.020009 | 0.133474 | 0.825846 | 1.000000 | 0.285584 | 0.175129 | 0.229800 | 0.173801 | 0.214017 | 0.161437 |
| likes | -0.013020 | 0.015948 | 0.013020 | 0.014115 | 0.057116 | 0.298010 | 0.285584 | 1.000000 | 0.327375 | 0.871651 | 0.329257 | 0.644959 | 0.295686 |
| likes_received | -0.022575 | 0.001355 | 0.022575 | 0.006484 | 0.027739 | 0.236461 | 0.175129 | 0.327375 | 1.000000 | 0.256996 | 0.973679 | 0.255364 | 0.947990 |
| mobile_likes | -0.026725 | 0.014513 | 0.026725 | 0.010372 | 0.028037 | 0.235650 | 0.229800 | 0.871651 | 0.256996 | 1.000000 | 0.288512 | 0.187616 | 0.190171 |
| mobile_likes_received | -0.024253 | 0.000485 | 0.024253 | 0.006424 | 0.023965 | 0.232699 | 0.173801 | 0.329257 | 0.973679 | 0.288512 | 1.000000 | 0.209996 | 0.850490 |
| www_likes | 0.015578 | 0.009332 | -0.015578 | 0.012116 | 0.070747 | 0.229798 | 0.214017 | 0.644959 | 0.255364 | 0.187616 | 0.209996 | 1.000000 | 0.296052 |
| www_likes_received | -0.018228 | 0.002449 | 0.018228 | 0.005992 | 0.030547 | 0.220726 | 0.161437 | 0.295686 | 0.947990 | 0.190171 | 0.850490 | 0.296052 | 1.000000 |
#PLotting Heat Map for the correlation Matrix.
correlations = fb_df.corr()
f, ax = plt.subplots(figsize=(22, 10))
sns.heatmap(data=correlations, annot = True, cmap='viridis')
sns.despine(left=True, bottom=True)
Observations :
Initial glance it is observed that there are two & three dark colored squares that get my attention which is 'likes' & 'mobile_likes', 'likes' & 'www_likes', 'likes_received' & 'mobile_likes_received', 'likes_received' & 'www_likes_received', 'www_likes_received' & 'mobile_likes_received', ('tenure', 'age' & 'dob_year') and 'friend_count' & 'friendships_initiated'
'likes' & 'mobile_likes' and 'likes' & 'www_likes' are also closely correlated so one of them we can keep for futher analysis and other 2 features can be dropped 'mobile_likes' and 'www_likes'.
'likes' is strongly correlated with 'mobile_likes' and 'www_likes' as sum of www_likes and mobile_likes is 'likes' column. Similarly 'likes_received' is strongly correlated with 'mobile_likes_received' and 'www_likes_received' as sum of www_likes_received and mobile_likes_received is 'likes' column.
'friend_count' & 'friendship_initiated' are closely correlated hence 'friendship_initiated' column is excluded. 'dob_year' is negatively correlated with 'age' and 'tenure' also 'age' & 'tenure' strongly correlated. 'dob_month' and 'dob_day' are not depend on other features hence we are excluding 'dob_year', 'dob_month' and 'dob_day'.
df = fb_df[['age','tenure', 'friend_count','likes','likes_received','gender']]
df.set_index('age', inplace=True)
corr = df.corr()
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(5, 4))
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
g= sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
sns.clustermap(corr, center=0, cmap="vlag",linewidths=.75, figsize=(4, 5))
<seaborn.matrix.ClusterGrid at 0x1e3232d9210>
sns.pairplot(fb_df[['age','gender','tenure','friend_count','friendships_initiated','likes']],
vars= ['age','tenure','friend_count','friendships_initiated','likes'],hue="gender", palette="husl")
plt.title('Pair Plot')
Text(0.5, 1.0, 'Pair Plot')